New York City Airbnb Market Analysis: Seasonal Pricing Dynamics (2025)¶
EDA / Data Visualization Final Project¶
Student: Afaq Khan
Andrew ID: afaqk
Date: December 5th, 2025
Course: Exploratory Data Analysis (EDA) and Visualization with Python 90-800
Executive Summary¶
New York City welcomes 60+ million tourists annually. This analysis tracks how the Airbnb market responds to seasonal demand across four quarterly snapshots in 2025.
Central Hypothesis: "NYC Airbnb prices increase significantly from winter to summer, with Manhattan showing the strongest seasonal variation."
Key Finding: NYC operates as a two-tier market. Manhattan shows +34% seasonal price swings. Outer boroughs stay flat.
Dataset¶
- Source: Inside Airbnb (insideairbnb.com)
- Snapshots: Q1 (03_Jan), Q2 (01_Apr), Q3 (01_Jul), Q4 (01_Oct) 2025
- Clean Records: ~87,000 listings
- Coverage: All 5 NYC boroughs
1. Setup and Data Loading¶
# Importing required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import folium
from folium.plugins import MarkerCluster
import squarify
import warnings
warnings.filterwarnings('ignore')
# This helps with inline display in VSCode
from IPython.display import display, HTML
# Setting style for visualizations
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
# Displaying settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.float_format', '{:.2f}'.format)
print("All packages imported successfully!")
All packages imported successfully!
# Loading all 4 quarterly datasets
df_jan = pd.read_csv('NYC_03 January 2025.csv.gz', compression='gzip')
df_apr = pd.read_csv('NYC_01 April 2025.csv.gz', compression='gzip')
df_jul = pd.read_csv('NYC_01 July 2025.csv.gz', compression='gzip')
df_oct = pd.read_csv('NYC_01 October 2025.csv.gz', compression='gzip')
print(f"Q1 - January 2025: {len(df_jan):,} listings")
print(f"Q2 - April 2025: {len(df_apr):,} listings")
print(f"Q3 - July 2025: {len(df_jul):,} listings")
print(f"Q4 - October 2025: {len(df_oct):,} listings")
Q1 - January 2025: 37,784 listings Q2 - April 2025: 37,147 listings Q3 - July 2025: 36,345 listings Q4 - October 2025: 36,111 listings
# Adding season/quarter identifiers before combining
df_jan['season'] = 'Q1 Winter (Jan)'
df_jan['quarter'] = 'Q1'
df_jan['snapshot_date'] = '2025-01-03'
df_apr['season'] = 'Q2 Spring (Apr)'
df_apr['quarter'] = 'Q2'
df_apr['snapshot_date'] = '2025-04-01'
df_jul['season'] = 'Q3 Summer (Jul)'
df_jul['quarter'] = 'Q3'
df_jul['snapshot_date'] = '2025-07-01'
df_oct['season'] = 'Q4 Fall (Oct)'
df_oct['quarter'] = 'Q4'
df_oct['snapshot_date'] = '2025-10-01'
# Combining datasets (using common columns only)
common_cols = list(set(df_jan.columns) & set(df_apr.columns) & set(df_jul.columns) & set(df_oct.columns))
common_cols.extend(['season', 'quarter', 'snapshot_date'])
common_cols = list(set(common_cols))
df = pd.concat([df_jan[common_cols], df_apr[common_cols], df_jul[common_cols], df_oct[common_cols]], ignore_index=True)
print(f"\nCombined dataset: {len(df):,} rows × {df.shape[1]} columns")
print(f"\nRows by season:")
print(df['season'].value_counts().sort_index())
Combined dataset: 147,387 rows × 78 columns Rows by season: season Q1 Winter (Jan) 37784 Q2 Spring (Apr) 37147 Q3 Summer (Jul) 36345 Q4 Fall (Oct) 36111 Name: count, dtype: int64
2. Data Exploration and Cleaning¶
# Initializing data exploration
print("=" * 60)
print("COMBINED DATASET STRUCTURE")
print("=" * 60)
df.info()
============================================================ COMBINED DATASET STRUCTURE ============================================================ <class 'pandas.core.frame.DataFrame'> RangeIndex: 147387 entries, 0 to 147386 Data columns (total 78 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 host_name 146180 non-null object 1 host_acceptance_rate 87725 non-null object 2 maximum_nights_avg_ntm 147364 non-null float64 3 minimum_minimum_nights 147348 non-null float64 4 neighbourhood_group_cleansed 147387 non-null object 5 minimum_nights 147387 non-null int64 6 picture_url 147385 non-null object 7 host_is_superhost 145781 non-null object 8 instant_bookable 147387 non-null object 9 calculated_host_listings_count_private_rooms 147387 non-null int64 10 id 147387 non-null int64 11 host_identity_verified 146177 non-null object 12 minimum_nights_avg_ntm 147364 non-null float64 13 calculated_host_listings_count 147387 non-null int64 14 neighbourhood 78024 non-null object 15 calculated_host_listings_count_entire_homes 147387 non-null int64 16 host_id 147387 non-null int64 17 room_type 147387 non-null object 18 host_url 147387 non-null object 19 source 147387 non-null object 20 minimum_maximum_nights 147348 non-null float64 21 description 143482 non-null object 22 beds 88155 non-null float64 23 host_about 83567 non-null object 24 last_scraped 147387 non-null object 25 review_scores_accuracy 101360 non-null float64 26 review_scores_checkin 101344 non-null float64 27 calculated_host_listings_count_shared_rooms 147387 non-null int64 28 longitude 147387 non-null float64 29 number_of_reviews_ltm 147387 non-null int64 30 last_review 101402 non-null object 31 review_scores_cleanliness 101397 non-null float64 32 host_response_rate 85546 non-null object 33 accommodates 147387 non-null int64 34 calendar_updated 0 non-null float64 35 review_scores_rating 101402 non-null float64 36 quarter 147387 non-null object 37 availability_30 147387 non-null int64 38 host_verifications 146177 non-null object 39 host_response_time 85546 non-null object 40 season 147387 non-null object 41 snapshot_date 147387 non-null object 42 name 147379 non-null object 43 maximum_maximum_nights 147348 non-null float64 44 host_since 146177 non-null object 45 maximum_minimum_nights 147348 non-null float64 46 amenities 147387 non-null object 47 host_has_profile_pic 146177 non-null object 48 availability_365 147387 non-null int64 49 listing_url 147387 non-null object 50 neighborhood_overview 78020 non-null object 51 host_neighbourhood 116616 non-null object 52 review_scores_communication 101377 non-null float64 53 maximum_nights 147387 non-null int64 54 number_of_reviews 147387 non-null int64 55 availability_60 147387 non-null int64 56 price 87750 non-null object 57 review_scores_location 101333 non-null float64 58 availability_90 147387 non-null int64 59 host_total_listings_count 146177 non-null float64 60 reviews_per_month 101402 non-null float64 61 has_availability 125216 non-null object 62 neighbourhood_cleansed 147387 non-null object 63 bedrooms 123691 non-null float64 64 number_of_reviews_l30d 147387 non-null int64 65 property_type 147387 non-null object 66 host_location 115533 non-null object 67 review_scores_value 101336 non-null float64 68 first_review 101402 non-null object 69 scrape_id 147387 non-null int64 70 calendar_last_scraped 147387 non-null object 71 bathrooms_text 147176 non-null object 72 host_listings_count 146177 non-null float64 73 latitude 147387 non-null float64 74 bathrooms 88559 non-null float64 75 license 21461 non-null object 76 host_thumbnail_url 146177 non-null object 77 host_picture_url 146177 non-null object dtypes: float64(22), int64(17), object(39) memory usage: 87.7+ MB
# Checking missing values
missing = df.isnull().sum().sort_values(ascending=False)
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Missing Count': missing, 'Percentage': missing_pct})
print("\nTop 15 columns with missing values:")
print(missing_df[missing_df['Missing Count'] > 0].head(15))
Top 15 columns with missing values:
Missing Count Percentage
calendar_updated 147387 100.00
license 125926 85.44
neighborhood_overview 69367 47.06
neighbourhood 69363 47.06
host_about 63820 43.30
host_response_time 61841 41.96
host_response_rate 61841 41.96
host_acceptance_rate 59662 40.48
price 59637 40.46
beds 59232 40.19
bathrooms 58828 39.91
review_scores_location 46054 31.25
review_scores_value 46051 31.24
review_scores_checkin 46043 31.24
review_scores_accuracy 46027 31.23
Cleaning Strategy¶
| Variable | Issue | Action |
|---|---|---|
| Price | 40% missing | Drop (inactive listings) |
| Price outliers | >$2000/night | Remove |
| Beds/bedrooms | Some missing | Impute using accommodates |
| Review scores | Missing for new listings | Keep as NaN |
| Booleans | t/f strings | Convert to True/False |
# =============================================================
# DATA QUALITY ASSESSMENT & CLEANING
# =============================================================
print(f"Before cleaning: {len(df):,} rows")
rows_before = len(df)
df = df[df['price'].notna()].copy()
print(f" After removing missing prices: {len(df):,} rows ({rows_before - len(df):,} removed)")
def clean_price(x):
if pd.isna(x):
return np.nan
return float(str(x).replace('$', '').replace(',', ''))
df['price_clean'] = df['price'].apply(clean_price)
outliers = (df['price_clean'] > 2000).sum()
df = df[df['price_clean'] <= 2000].copy()
print(f" After removing price outliers (>$2000): {len(df):,} rows ({outliers} removed)")
if 'bedrooms' in df.columns:
df['bedrooms'] = df.groupby('accommodates')['bedrooms'].transform(lambda x: x.fillna(x.median()))
df['bedrooms'] = df['bedrooms'].fillna(df['bedrooms'].median())
if 'beds' in df.columns:
df['beds'] = df['beds'].fillna(df['bedrooms'] if 'bedrooms' in df.columns else 1)
if 'bathrooms' in df.columns:
df['bathrooms'] = df.groupby('accommodates')['bathrooms'].transform(lambda x: x.fillna(x.median()))
df['bathrooms'] = df['bathrooms'].fillna(df['bathrooms'].median())
print(f"\n✓ Imputed missing values for: bedrooms, beds, bathrooms")
df['host_since'] = pd.to_datetime(df['host_since'], errors='coerce')
df['last_review'] = pd.to_datetime(df['last_review'], errors='coerce')
df['first_review'] = pd.to_datetime(df['first_review'], errors='coerce')
df['snapshot_date'] = pd.to_datetime(df['snapshot_date'])
df['host_years'] = (df['snapshot_date'] - df['host_since']).dt.days / 365.25
def clean_percentage(x):
if pd.isna(x):
return np.nan
return float(str(x).replace('%', ''))
if 'host_response_rate' in df.columns:
df['host_response_rate_clean'] = df['host_response_rate'].apply(clean_percentage)
if 'host_acceptance_rate' in df.columns:
df['host_acceptance_rate_clean'] = df['host_acceptance_rate'].apply(clean_percentage)
df['is_superhost'] = df['host_is_superhost'].map({'t': True, 'f': False})
df['is_instant_bookable'] = df['instant_bookable'].map({'t': True, 'f': False})
print(f"\n" + "=" * 60)
print("CLEANING COMPLETE")
print("=" * 60)
print(f"Final dataset: {len(df):,} rows × {df.shape[1]} columns")
print(f"\nBy season:")
print(df['season'].value_counts().sort_index())
df_analysis = df.copy()
print(f"\n✅ Ready for analysis with {len(df_analysis):,} clean listings!")
Before cleaning: 147,387 rows After removing missing prices: 87,750 rows (59,637 removed) After removing price outliers (>$2000): 87,050 rows (700 removed) ✓ Imputed missing values for: bedrooms, beds, bathrooms ============================================================ CLEANING COMPLETE ============================================================ Final dataset: 87,050 rows × 84 columns By season: season Q1 Winter (Jan) 22899 Q2 Spring (Apr) 22050 Q3 Summer (Jul) 21091 Q4 Fall (Oct) 21010 Name: count, dtype: int64 ✅ Ready for analysis with 87,050 clean listings!
# Graph 1: Price Distribution by Season
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()
seasons = ['Q1 Winter (Jan)', 'Q2 Spring (Apr)', 'Q3 Summer (Jul)', 'Q4 Fall (Oct)']
colors = ['#3498db', '#2ecc71', '#e74c3c', '#f39c12']
for ax, season, color in zip(axes, seasons, colors):
subset = df_analysis[df_analysis['season'] == season]['price_clean']
sns.histplot(subset, bins=50, kde=True, color=color, edgecolor='black', alpha=0.7, ax=ax)
mean_price = subset.mean()
median_price = subset.median()
ax.axvline(mean_price, color='darkred', linestyle='--', linewidth=2, label=f'Mean: ${mean_price:.0f}')
ax.axvline(median_price, color='darkgreen', linestyle='-', linewidth=2, label=f'Median: ${median_price:.0f}')
ax.set_xlabel('Price per Night ($)', fontsize=11)
ax.set_ylabel('Frequency', fontsize=11)
ax.set_title(f'{season}\n(n={len(subset):,})', fontsize=12)
ax.legend(fontsize=9)
ax.set_xlim(0, 800)
# Light gridlines
ax.grid(True, linestyle=':', linewidth=0.5, alpha=0.4, color='gray')
ax.set_axisbelow(True)
plt.suptitle('NYC Airbnb Price Distribution by Season (2025)', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
print("\nMedian Price by Season:")
for season in seasons:
median = df_analysis[df_analysis['season'] == season]['price_clean'].median()
print(f" {season}: ${median:.0f}")
Median Price by Season: Q1 Winter (Jan): $132 Q2 Spring (Apr): $144 Q3 Summer (Jul): $150 Q4 Fall (Oct): $151
Interpretation:
Prices are right-skewed. Most listings fall between $50-200/night. The distribution shifts rightward from Q1 to Q4:
- Q1 (Winter): Median $132
- Q4 (Fall): Median $151
- Change: +14.4%
The shift is real, but where is it coming from?
Graph 2: Market Size by Borough¶
Purpose: Where are the listings concentrated?
# Graph 2: Bar Chart - Listings by Borough
fig, ax = plt.subplots(figsize=(12, 6))
oct_data = df_analysis[df_analysis['quarter'] == 'Q4']
borough_counts = oct_data['neighbourhood_group_cleansed'].value_counts()
colors = ['#e74c3c', '#3498db', '#2ecc71', '#9b59b6', '#f39c12']
bars = ax.bar(borough_counts.index, borough_counts.values, color=colors, edgecolor=colors)
for bar, val in zip(bars, borough_counts.values):
pct = val / borough_counts.sum() * 100
ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 100,
f'{val:,}\n({pct:.1f}%)',
ha='center', fontsize=11, fontweight='bold')
ax.set_xlabel('Borough', fontsize=12)
ax.set_title('NYC Airbnb Listings by Borough (October 2025)', fontsize=14)
# Remove y-axis
ax.set_ylabel('')
ax.set_yticks([])
# No grid
ax.grid(False)
# Clean spines
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
plt.tight_layout()
plt.show()
print(f"\nManhattan and Brooklyn combined: {(borough_counts['Manhattan'] + borough_counts['Brooklyn']) / borough_counts.sum() * 100:.1f}% of all listings")
Manhattan and Brooklyn combined: 78.8% of all listings
Interpretation:
Two boroughs dominate:
- Manhattan: 9,106 listings (43%)
- Brooklyn: 7,458 listings (35%)
- Combined: 78% of the market
Queens holds 16%. Bronx and Staten Island together account for just 5%.
# Graph 3: Box Plot - Price by Borough and Season
fig, ax = plt.subplots(figsize=(14, 7))
borough_order = df_analysis.groupby('neighbourhood_group_cleansed')['price_clean'].median().sort_values(ascending=False).index
sns.boxplot(data=df_analysis, x='neighbourhood_group_cleansed', y='price_clean', hue='quarter',
order=borough_order, palette=['#3498db', '#2ecc71', '#e74c3c', '#f39c12'], ax=ax)
ax.set_xlabel('Borough', fontsize=12)
ax.set_ylabel('Price per Night ($)', fontsize=12)
ax.set_title('NYC Airbnb Prices by Borough: Quarterly Comparison', fontsize=14)
ax.legend(title='Quarter', loc='upper right')
ax.set_ylim(0, 600)
plt.tight_layout()
plt.show()
print("\nMedian Price by Borough and Quarter:")
pivot = df_analysis.pivot_table(values='price_clean', index='neighbourhood_group_cleansed',
columns='quarter', aggfunc='median').round(0)
pivot['Q1→Q4 Change'] = pivot['Q4'] - pivot['Q1']
pivot['% Change'] = ((pivot['Q4'] - pivot['Q1']) / pivot['Q1'] * 100).round(1)
print(pivot)
Median Price by Borough and Quarter: quarter Q1 Q2 Q3 Q4 Q1→Q4 Change \ neighbourhood_group_cleansed Bronx 90.00 90.00 93.00 94.00 4.00 Brooklyn 120.00 124.00 128.00 130.00 10.00 Manhattan 168.00 202.00 215.00 225.00 57.00 Queens 99.00 98.00 101.00 103.00 4.00 Staten Island 100.00 100.00 100.00 99.00 -1.00 quarter % Change neighbourhood_group_cleansed Bronx 4.40 Brooklyn 8.30 Manhattan 33.90 Queens 4.00 Staten Island -1.00
Interpretation:
Here's the first hint of the central finding. Manhattan's boxes climb dramatically from Q1 to Q4, with medians rising from $168 to $225. The outer boroughs tell a different story, their boxes remain nearly identical across all four seasons.
This is the first visual evidence of two separate markets operating within one city.¶
Graph 4: Seasonal Price Trends by Borough — The Hero Chart¶
Purpose: Track price trajectories across the year. This is the most important visualization.
# Graph 4: Line Chart - Seasonal Price Trends by Borough
fig, ax = plt.subplots(figsize=(12, 6))
quarters = ['Q1', 'Q2', 'Q3', 'Q4']
quarter_labels = ['Jan (Winter)', 'Apr (Spring)', 'Jul (Summer)', 'Oct (Fall)']
borough_colors = {'Manhattan': '#e74c3c', 'Brooklyn': '#3498db', 'Queens': '#2ecc71',
'Bronx': '#9b59b6', 'Staten Island': '#f39c12'}
for borough, color in borough_colors.items():
borough_data = df_analysis[df_analysis['neighbourhood_group_cleansed'] == borough]
medians = [borough_data[borough_data['quarter'] == q]['price_clean'].median() for q in quarters]
ax.plot(quarter_labels, medians, marker='o', linewidth=2, markersize=8,
color=color, label=borough)
ax.set_xlabel('Quarter', fontsize=12)
ax.set_ylabel('Median Price ($)', fontsize=12)
ax.set_title('NYC Airbnb Seasonal Price Trends by Borough (2025)', fontsize=14)
ax.legend(title='Borough', loc='upper left')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Interpretation:
This is the key chart.
Manhattan's red line surges: $168 → $225 (+34%). Everyone else flatlines:
- Brooklyn: +8%
- Queens: +4%
- Bronx: +4%
- Staten Island: -1%
NYC isn't one Airbnb market, it's two. Manhattan operates like hotels with dynamic seasonal pricing. Outer boroughs operate like apartments with stable rents.
Graph 5: Price Heatmap — Borough × Quarter¶
Purpose: Summary matrix view of the two-tier finding.
# Graph 5: Heatmap - Price by Borough and Quarter
fig, ax = plt.subplots(figsize=(10, 6))
pivot_heatmap = df_analysis.pivot_table(values='price_clean',
index='neighbourhood_group_cleansed',
columns='quarter',
aggfunc='median')
pivot_heatmap = pivot_heatmap[['Q1', 'Q2', 'Q3', 'Q4']]
sns.heatmap(pivot_heatmap, annot=True, fmt='.0f', cmap='YlOrRd',
linewidths=0.5, cbar_kws={'label': 'Median Price ($)'}, ax=ax)
ax.set_xlabel('Quarter', fontsize=12)
ax.set_ylabel('Borough', fontsize=12)
ax.set_title('Median Nightly Price by Borough and Quarter (2025)', fontsize=14)
plt.tight_layout()
plt.show()
Interpretation:
The heatmap distills the story into one image. Manhattan's row burns red, deepening each quarter. Outer borough rows stay cool yellow.
The $57 gap between Manhattan Q1 ($168) and Q4 ($225) is the largest single-cell change.
Graph 6: Price Distribution Shape by Borough¶
Purpose: Compare the full shape of price distributions.
# Graph 6: Violin Plot - Price by Borough
fig, ax = plt.subplots(figsize=(12, 7))
# Borough order by median price
borough_order = df_analysis.groupby('neighbourhood_group_cleansed')['price_clean'].median().sort_values(ascending=False).index
# Consistent colors matching other graphs (from your screenshot)
borough_palette = {
'Manhattan': '#e74c3c', # Red
'Brooklyn': '#3498db', # Blue
'Queens': '#2ecc71', # Green
'Bronx': '#9b59b6', # Purple
'Staten Island': '#f39c12' # Orange
}
# Create color list in borough_order
colors = [borough_palette[b] for b in borough_order]
sns.violinplot(data=df_analysis, x='neighbourhood_group_cleansed', y='price_clean',
order=borough_order, palette=colors, ax=ax, inner='quartile',
cut=0, # FIX: Don't extend beyond data range
width=0.7) # FIX: Shrink violin width
# Add median value labels
medians = df_analysis.groupby('neighbourhood_group_cleansed')['price_clean'].median()
for i, borough in enumerate(borough_order):
median_val = medians[borough]
ax.text(i, median_val + 15, f'${median_val:.0f}', ha='center', fontsize=10, fontweight='bold', color='black')
ax.set_xlabel('Borough', fontsize=12)
ax.set_ylabel('Price per Night ($)', fontsize=12)
ax.set_title('Price Distribution by Borough', fontsize=14)
ax.set_ylim(0, 500)
# Light dotted gridlines
ax.grid(True, linestyle=':', linewidth=0.5, alpha=0.4, color='gray')
# Clean spines
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
plt.show()
Interpretation:
Manhattan's wide violin spans $50 to $500+. Outer boroughs are tight, most Bronx listings cluster around $90.
If we want price variety, Manhattan has it. If we want predictability, we have to head to the outer boroughs.
# Graph 7: Manhattan Neighborhood Price Analysis
fig, ax = plt.subplots(figsize=(14, 8))
# Filter Manhattan data
manhattan_data = df_analysis[df_analysis['neighbourhood_group_cleansed'] == 'Manhattan']
# Get top 12 neighborhoods by listing count
top_neighborhoods = manhattan_data['neighbourhood_cleansed'].value_counts().head(12).index
# Filter to top neighborhoods
manhattan_top = manhattan_data[manhattan_data['neighbourhood_cleansed'].isin(top_neighborhoods)]
# Calculate median price by neighborhood and quarter
neighborhood_prices = manhattan_top.pivot_table(
values='price_clean',
index='neighbourhood_cleansed',
columns='quarter',
aggfunc='median'
).reindex(columns=['Q1', 'Q2', 'Q3', 'Q4'])
# Sort by Q4 price (highest first)
neighborhood_prices = neighborhood_prices.sort_values('Q4', ascending=True)
# Plot horizontal bar chart comparing Q1 vs Q4
y_pos = range(len(neighborhood_prices))
bars_q1 = ax.barh([y - 0.2 for y in y_pos], neighborhood_prices['Q1'],
height=0.4, label='Q1 Winter (Jan)', color='#3498db', edgecolor='#3498db')
bars_q4 = ax.barh([y + 0.2 for y in y_pos], neighborhood_prices['Q4'],
height=0.4, label='Q4 Fall (Oct)', color='#e74c3c', edgecolor='#e74c3c')
# Add value labels
for i, (q1, q4) in enumerate(zip(neighborhood_prices['Q1'], neighborhood_prices['Q4'])):
ax.text(q1 + 5, i - 0.2, f'${q1:.0f}', va='center', fontsize=9)
ax.text(q4 + 5, i + 0.2, f'${q4:.0f}', va='center', fontsize=9)
# Add percentage change
pct_change = ((q4 - q1) / q1) * 100
ax.text(neighborhood_prices['Q4'].max() + 30, i, f'+{pct_change:.0f}%',
va='center', fontsize=9, fontweight='bold', color='#27ae60')
ax.set_yticks(y_pos)
ax.set_yticklabels(neighborhood_prices.index)
ax.set_title('Manhattan Neighborhoods: Winter vs. Fall Pricing — Median $/Night (2025)', fontsize=14, fontweight='bold')
ax.legend(loc='lower right', frameon=False)
# Remove x-axis
ax.set_xlabel('')
ax.set_xticks([])
# Clean spines
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
# Light grid (horizontal only)
ax.grid(True, axis='x', linestyle=':', linewidth=0.5, alpha=0.4, color='gray')
plt.tight_layout()
plt.show()
# Print insights
print("\n Top 3 Seasonal Price Increases in Manhattan:")
changes = ((neighborhood_prices['Q4'] - neighborhood_prices['Q1']) / neighborhood_prices['Q1'] * 100).sort_values(ascending=False)
for neighborhood in changes.head(3).index:
print(f" {neighborhood}: +{changes[neighborhood]:.0f}%")
Top 3 Seasonal Price Increases in Manhattan: Midtown: +58% Financial District: +56% Upper East Side: +43%
Interpretation:
Not all of Manhattan behaves the same.
Tourist hotspots surge:
- Midtown: +58% ($226 → $358)
- Financial District: +56% ($240 → $375)
- Upper East Side: +43% ($140 → $200)
Residential areas hold steady:
- Harlem: +10% ($105 → $115)
- Washington Heights: -1% ($80 → $79)
Seasonal pricing pressure is geographic — it concentrates in Midtown, the Financial District, and the Upper East/West Sides, while uptown residential neighborhoods like Harlem and Washington Heights remain stable year-round.
Graph 8: NYC's Most Expensive Neighborhoods¶
Purpose: Cross-borough neighborhood comparison.
# Lollipop Chart — Top 15 Most Expensive Neighborhoods (All NYC)
fig, ax = plt.subplots(figsize=(12, 10), dpi=100)
# Calculate median price by neighborhood (all boroughs)
neighborhood_prices = df_analysis.groupby(['neighbourhood_cleansed', 'neighbourhood_group_cleansed']).agg({
'price_clean': 'median',
'id': 'count'
}).reset_index()
neighborhood_prices.columns = ['Neighborhood', 'Borough', 'Median_Price', 'Listings']
# Filter to neighborhoods with at least 50 listings (avoid small sample noise)
neighborhood_prices = neighborhood_prices[neighborhood_prices['Listings'] >= 50]
# Get top 15 most expensive
top_15 = neighborhood_prices.nlargest(15, 'Median_Price').sort_values('Median_Price', ascending=True)
# Borough colors
borough_colors = {'Manhattan': '#e74c3c', 'Brooklyn': '#3498db', 'Queens': '#2ecc71',
'Bronx': '#9b59b6', 'Staten Island': '#f39c12'}
colors = [borough_colors[borough] for borough in top_15['Borough']]
# Plot lollipops
y_pos = range(len(top_15))
# Draw stems (lines)
ax.hlines(y=y_pos, xmin=0, xmax=top_15['Median_Price'], color=colors, linewidth=2)
# Draw lollipop heads (circles)
ax.scatter(top_15['Median_Price'], y_pos, color=colors, s=150, zorder=3)
# Add price labels
for i, (idx, row) in enumerate(top_15.iterrows()):
ax.text(row['Median_Price'] + 5, i, f"${row['Median_Price']:.0f}",
va='center', fontsize=10, fontweight='bold')
# Add neighborhood labels with borough in parentheses
labels = [f"{row['Neighborhood']} ({row['Borough'][:3]})" for idx, row in top_15.iterrows()]
ax.set_yticks(y_pos)
ax.set_yticklabels(labels, fontsize=10)
# Legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=color, label=borough) for borough, color in borough_colors.items()
if borough in top_15['Borough'].values]
ax.legend(handles=legend_elements, loc='lower right', title='Borough', fontsize=10, title_fontsize=11)
# Formatting
ax.set_xlabel('')
ax.set_xticks([])
ax.set_title("NYC's Most Expensive Neighborhoods — Where Are the Priciest Airbnbs? (2025)",
fontsize=14, fontweight='bold')
# Clean spines
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.set_xlim(0, top_15['Median_Price'].max() * 1.15)
plt.tight_layout()
plt.show()
# Print insight
manhattan_count = (top_15['Borough'] == 'Manhattan').sum()
print(f"\nTOP 15 MOST EXPENSIVE NEIGHBORHOODS:")
print(f" Manhattan dominates: {manhattan_count}/15 neighborhoods")
print(f" Most expensive: {top_15.iloc[-1]['Neighborhood']} (${top_15.iloc[-1]['Median_Price']:.0f}/night)")
TOP 15 MOST EXPENSIVE NEIGHBORHOODS: Manhattan dominates: 11/15 neighborhoods Most expensive: NoHo ($423/night)
Interpretation:
Manhattan claims 11 of the top 15 spots. Brooklyn sneaks in 4: Vinegar Hill, DUMBO, Cobble Hill, Boerum Hill, but they barely match Manhattan's mid-tier areas.
NoHo and Tribeca top the list at $420+/night. Even Brooklyn's priciest (Vinegar Hill at $299) wouldn't crack Manhattan's top 10.
# Graph 9: Correlation Heatmap
numeric_cols = ['price_clean', 'accommodates', 'bedrooms', 'beds',
'number_of_reviews', 'review_scores_rating',
'availability_365', 'reviews_per_month', 'calculated_host_listings_count']
numeric_cols = [col for col in numeric_cols if col in df_analysis.columns]
corr_data = df_analysis[numeric_cols].dropna()
correlation_matrix = corr_data.corr()
fig, ax = plt.subplots(figsize=(10, 8))
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
sns.heatmap(correlation_matrix, mask=mask, annot=True, fmt='.2f',
cmap='RdBu_r', center=0, square=True, linewidths=0.5,
cbar_kws={'shrink': 0.8}, ax=ax)
ax.set_title('Correlation Matrix: Key Listing Variables', fontsize=14)
plt.tight_layout()
plt.show()
print("\nKey Correlations with Price:")
price_corr = correlation_matrix['price_clean'].drop('price_clean').sort_values(key=abs, ascending=False)
print(price_corr.round(3))
Key Correlations with Price: accommodates 0.50 beds 0.40 bedrooms 0.38 calculated_host_listings_count 0.08 review_scores_rating 0.05 availability_365 0.04 reviews_per_month 0.04 number_of_reviews -0.02 Name: price_clean, dtype: float64
Interpretation:
Capacity matters:
- Accommodates: r = 0.50
- Beds: r = 0.40
- Bedrooms: r = 0.38
Reviews don't:
- Number of reviews: r = -0.02
- Review scores: r = 0.05
Bigger spaces command higher prices. Popularity and ratings are irrelevant.
Bonus: Superhost Pricing Analysis¶
Purpose: Do Superhosts charge more?
# Superhost vs Regular Host Analysis
superhost_premium = df_analysis.groupby(['is_superhost', 'quarter'])['price_clean'].median().unstack()
print("=" * 50)
print("SUPERHOST VS REGULAR HOST PRICING ANALYSIS")
print("=" * 50)
print("\n Superhost = Airbnb's top-tier hosts (4.8+ stars, 90%+ response, <1% cancellation)")
print("\nMedian Price by Quarter:")
print("-" * 50)
print(f"{'Host Type':<15} {'Q1':>8} {'Q2':>8} {'Q3':>8} {'Q4':>8}")
print("-" * 50)
print(f"{'Regular Host':<15} ${superhost_premium.loc[False, 'Q1']:>6.0f} ${superhost_premium.loc[False, 'Q2']:>6.0f} ${superhost_premium.loc[False, 'Q3']:>6.0f} ${superhost_premium.loc[False, 'Q4']:>6.0f}")
print(f"{'Superhost':<15} ${superhost_premium.loc[True, 'Q1']:>6.0f} ${superhost_premium.loc[True, 'Q2']:>6.0f} ${superhost_premium.loc[True, 'Q3']:>6.0f} ${superhost_premium.loc[True, 'Q4']:>6.0f}")
print("-" * 50)
avg_regular = superhost_premium.loc[False].mean()
avg_super = superhost_premium.loc[True].mean()
diff = avg_super - avg_regular
print(f"\nAverage (Regular Host): ${avg_regular:.0f}/night")
print(f"Average (Superhost): ${avg_super:.0f}/night")
print(f"\nKEY INSIGHT: Superhosts charge ${abs(diff):.0f}/night LESS than regular hosts")
print(" → Superhosts compete on VOLUME & VALUE, not premium pricing")
================================================== SUPERHOST VS REGULAR HOST PRICING ANALYSIS ================================================== Superhost = Airbnb's top-tier hosts (4.8+ stars, 90%+ response, <1% cancellation) Median Price by Quarter: -------------------------------------------------- Host Type Q1 Q2 Q3 Q4 -------------------------------------------------- Regular Host $ 133 $ 150 $ 154 $ 153 Superhost $ 130 $ 135 $ 149 $ 150 -------------------------------------------------- Average (Regular Host): $148/night Average (Superhost): $141/night KEY INSIGHT: Superhosts charge $6/night LESS than regular hosts → Superhosts compete on VOLUME & VALUE, not premium pricing
Interpretation:
Surprise! Superhosts charge $6/night LESS than regular hosts. They compete on volume and value, not premium pricing.
Graph 10: Price vs Popularity¶
Purpose: Visual proof that reviews don't drive price.
# Graph 10: Scatter Plot - Price vs Reviews by Borough
fig, ax = plt.subplots(figsize=(14, 7))
scatter_data = df_analysis[df_analysis['number_of_reviews'] > 0].sample(n=min(5000, len(df_analysis)), random_state=42)
borough_colors = {'Manhattan': '#e74c3c', 'Brooklyn': '#3498db', 'Queens': '#2ecc71',
'Bronx': '#9b59b6', 'Staten Island': '#f39c12'}
for borough, color in borough_colors.items():
subset = scatter_data[scatter_data['neighbourhood_group_cleansed'] == borough]
ax.scatter(subset['number_of_reviews'], subset['price_clean'],
c=color, alpha=0.5, s=20, label=borough, edgecolor='none')
ax.set_xlabel('Number of Reviews', fontsize=12)
ax.set_ylabel('Price per Night ($)', fontsize=12)
ax.set_title('Price vs. Popularity by Borough', fontsize=14)
ax.set_xlim(0, scatter_data['number_of_reviews'].quantile(0.95))
ax.set_ylim(0, 600)
# FIXED: Move legend outside plot area
ax.legend(title='Borough', bbox_to_anchor=(1.02, 1), loc='upper left', frameon=True)
plt.tight_layout()
plt.show()
Interpretation:
A cloud with no pattern. High-priced listings have few reviews. Low-priced listings have hundreds.
Notice the colors: Manhattan dots (red) float above everyone else regardless of review count. Location trumps popularity.
Graph 11: Price Distribution by Room Type¶
Purpose: How do room types differ in pricing?
# Graph 11: KDE Plot by Room Type
fig, ax = plt.subplots(figsize=(12, 6))
room_types = df_analysis['room_type'].unique()
colors = sns.color_palette('Set2', len(room_types))
for room, color in zip(room_types, colors):
subset = df_analysis[df_analysis['room_type'] == room]['price_clean']
if len(subset) > 50:
sns.kdeplot(subset, ax=ax, label=f"{room} (n={len(subset):,})",
color=color, linewidth=2, fill=True, alpha=0.2)
ax.set_xlabel('Price per Night ($)', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title('Price Distribution by Room Type', fontsize=14)
ax.set_xlim(0, 600)
ax.legend(title='Room Type')
# Light dotted gridlines
ax.grid(True, linestyle=':', linewidth=0.5, alpha=0.5, color='gray')
# Clean spines
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
plt.show()
Interpretation:
Clear separation:
- Private rooms: Peak around $60-75
- Entire homes: Peak around $125-150, long tail to $500+
- Shared rooms: Cluster at the bottom
# Graph 12: Stacked Bar (Horizontal) - Room Type by Season
fig, ax = plt.subplots(figsize=(12, 6))
# Create crosstab
crosstab = pd.crosstab(df_analysis['season'], df_analysis['room_type'], normalize='index') * 100
crosstab = crosstab.reindex(['Q1 Winter (Jan)', 'Q2 Spring (Apr)', 'Q3 Summer (Jul)', 'Q4 Fall (Oct)'])
# Distinct colors for each room type
colors = ['#3498db', '#2ecc71', '#e74c3c', '#f39c12'] # Blue, Green, Red, Orange
# Plot horizontal stacked bar
crosstab.plot(kind='barh', stacked=True, ax=ax, color=colors,
edgecolor='white', linewidth=1, width=0.7)
# Add percentage labels
for container_idx, container in enumerate(ax.containers):
for bar in container:
width = bar.get_width()
if width > 3: # Only label segments > 3%
x_pos = bar.get_x() + width / 2
y_pos = bar.get_y() + bar.get_height() / 2
ax.text(x_pos, y_pos, f'{width:.1f}%', ha='center', va='center',
fontsize=10, fontweight='bold', color='white')
# Remove grid
ax.grid(False)
# Clean spines
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
# Remove x-axis
ax.set_xticks([])
ax.set_xlabel('')
ax.set_ylabel('Season', fontsize=12, fontweight='bold')
ax.set_title('Room Type Distribution by Season', fontsize=14, fontweight='bold')
ax.legend(title='Room Type', bbox_to_anchor=(1.02, 1), loc='upper left', frameon=False)
plt.tight_layout()
plt.show()
Interpretation:
The bars look identical — and that's the point. Market composition stays frozen:
- ~56% Entire homes
- ~42% Private rooms
- ~2% Hotel/Shared rooms
This indicates that seasons change prices, not structure.
Graph 13: Market Flow : Borough → Room Type → Price Tier¶
Purpose: Visualize how listings flow through the market.
# Advanced Graph: Sankey Diagram — Borough → Room Type → Price Tier
import plotly.graph_objects as go
# Create price tiers
def price_tier(price):
if price < 75:
return 'Budget (<$75)'
elif price < 150:
return 'Mid-Range ($75-150)'
elif price < 300:
return 'Premium ($150-300)'
else:
return 'Luxury ($300+)'
# Use Q4 data
q4_data = df_analysis[df_analysis['quarter'] == 'Q4'].copy()
q4_data['price_tier'] = q4_data['price_clean'].apply(price_tier)
# Count flows: Borough → Room Type → Price Tier
flow_data = q4_data.groupby(['neighbourhood_group_cleansed', 'room_type', 'price_tier']).size().reset_index(name='count')
# Create node labels
boroughs = ['Manhattan', 'Brooklyn', 'Queens', 'Bronx', 'Staten Island']
room_types = ['Entire home/apt', 'Private room', 'Hotel room', 'Shared room']
price_tiers = ['Budget (<$75)', 'Mid-Range ($75-150)', 'Premium ($150-300)', 'Luxury ($300+)']
all_nodes = boroughs + room_types + price_tiers
node_indices = {node: i for i, node in enumerate(all_nodes)}
# Define node colors
node_colors = (
['#e74c3c', '#3498db', '#2ecc71', '#9b59b6', '#f39c12'] + # Boroughs
['#85c1e9', '#85c1e9', '#85c1e9', '#85c1e9'] + # Room types (light blue)
['#2ecc71', '#f1c40f', '#e67e22', '#c0392b'] # Price tiers (green to red)
)
# Build source, target, value lists
sources = []
targets = []
values = []
link_colors = []
# Borough → Room Type flows
borough_room = q4_data.groupby(['neighbourhood_group_cleansed', 'room_type']).size().reset_index(name='count')
for _, row in borough_room.iterrows():
sources.append(node_indices[row['neighbourhood_group_cleansed']])
targets.append(node_indices[row['room_type']])
values.append(row['count'])
# Color by borough
borough_idx = boroughs.index(row['neighbourhood_group_cleansed'])
link_colors.append(node_colors[borough_idx].replace(')', ', 0.4)').replace('rgb', 'rgba').replace('#e74c3c', 'rgba(231,76,60,0.4)').replace('#3498db', 'rgba(52,152,219,0.4)').replace('#2ecc71', 'rgba(46,204,113,0.4)').replace('#9b59b6', 'rgba(155,89,182,0.4)').replace('#f39c12', 'rgba(243,156,18,0.4)'))
# Room Type → Price Tier flows
room_price = q4_data.groupby(['room_type', 'price_tier']).size().reset_index(name='count')
for _, row in room_price.iterrows():
sources.append(node_indices[row['room_type']])
targets.append(node_indices[row['price_tier']])
values.append(row['count'])
link_colors.append('rgba(200,200,200,0.4)')
# Create Sankey
fig = go.Figure(data=[go.Sankey(
node=dict(
pad=20,
thickness=20,
line=dict(color='black', width=0.5),
label=all_nodes,
color=node_colors
),
link=dict(
source=sources,
target=targets,
value=values,
color=link_colors
)
)])
fig.update_layout(
title_text="How Do NYC Airbnb Listings Flow? — Borough → Room Type → Price Tier (2025)",
font_size=12,
height=600,
width=1000
)
# Display
from IPython.display import display, HTML
fig.write_html("sankey_chart.html")
display(HTML(open("sankey_chart.html").read()))
print("\nThis diagram shows how listings flow from Borough → Room Type → Price Tier")
print(" Wider bands = more listings following that path")
This diagram shows how listings flow from Borough → Room Type → Price Tier Wider bands = more listings following that path
Interpretation:
Follow the bands:
- Manhattan → Entire Home → Premium/Luxury — the dominant high-end path
- Brooklyn → Private Room → Budget/Mid-Range — Brooklyn's affordable core
- Outer Boroughs → Budget — almost everything funnels to lower price tiers
The Sankey reveals NYC Airbnb's two pipelines: a Manhattan premium market and an outer-borough budget market.
Graph 14: Who Owns NYC Airbnb?¶
Purpose: Is this "home-sharing" or business?
# Graph 14: Host Concentration by Borough - Horizontal Bars
fig, axes = plt.subplots(2, 3, figsize=(15, 10), dpi=100)
axes = axes.flatten()
boroughs = ['Manhattan', 'Brooklyn', 'Queens', 'Bronx', 'Staten Island']
bar_colors = ['#2ecc71', '#3498db', '#f39c12', '#e74c3c']
labels = ['Individual (1)', 'Small-Scale (2-5)', 'Professional (6-20)', 'Commercial (20+)']
# Store results for summary
summary_data = []
for idx, borough in enumerate(boroughs):
ax = axes[idx]
# Filter by borough
borough_data = df_analysis[df_analysis['neighbourhood_group_cleansed'] == borough]
listings_per_host = borough_data.groupby('host_id')['id'].count()
# Categorize
host_categories = pd.cut(listings_per_host,
bins=[0, 1, 5, 20, float('inf')],
labels=labels)
category_counts = host_categories.value_counts().reindex(labels)
total_hosts = len(listings_per_host)
# Store for summary
individual_pct = (listings_per_host == 1).sum() / total_hosts * 100
summary_data.append({'Borough': borough, 'Individual %': individual_pct, 'Total Hosts': total_hosts})
# Plot horizontal bars
bars = ax.barh(range(len(labels)), category_counts.values, color=bar_colors, edgecolor=bar_colors)
# Add value labels (count + percentage)
for bar, val in zip(bars, category_counts.values):
pct = val / total_hosts * 100
ax.text(val + total_hosts * 0.02, bar.get_y() + bar.get_height()/2,
f'{val:,} ({pct:.0f}%)', ha='left', va='center', fontsize=9, fontweight='bold')
# Borough label - top left inside chart
ax.text(0.02, 0.98, f'{borough}\n({total_hosts:,} hosts)', transform=ax.transAxes,
fontsize=11, fontweight='bold', va='top', ha='left',
bbox=dict(boxstyle='round,pad=0.3', facecolor='white', edgecolor='gray', alpha=0.9))
ax.set_xlabel('')
ax.set_xticks([])
ax.set_yticks([])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.invert_yaxis()
# Use 6th subplot space for legend
ax_legend = axes[5]
ax_legend.axis('off')
# Vertical legend in the empty space
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=color, label=label) for color, label in zip(bar_colors, labels)]
ax_legend.legend(handles=legend_elements, loc='center', frameon=True, fontsize=11,
title='Host Type', title_fontsize=12, facecolor='white', edgecolor='gray')
plt.suptitle('Host Concentration by Borough (2025)',
fontsize=14, fontweight='bold')
plt.tight_layout()
# Add partition lines
fig.add_artist(plt.Line2D([0.33, 0.33], [0.05, 0.95], color='lightgray', linewidth=1, linestyle='--', transform=fig.transFigure))
fig.add_artist(plt.Line2D([0.66, 0.66], [0.05, 0.95], color='lightgray', linewidth=1, linestyle='--', transform=fig.transFigure))
fig.add_artist(plt.Line2D([0.02, 0.98], [0.5, 0.5], color='lightgray', linewidth=1, linestyle='--', transform=fig.transFigure))
plt.show()
Interpretation:
Only 8-14% of hosts have a single listing, i.e., true home-sharers. The majority (66-72%) operate 2-5 listings. Another 15-22% run professional portfolios.
Airbnb in NYC is dominated by small-scale operators, not individual homeowners sharing spare rooms.
Graph 15: Availability Patterns : Policy Implications¶
Purpose: How often are listings actually available?
# Graph 15: Availability Tiers Treemap
fig, ax = plt.subplots(figsize=(12, 8), dpi=100)
# Create availability tiers
def availability_tier(days):
if days < 30:
return 'Rarely Available (<30 days)'
elif days < 180:
return 'Occasional (30-180 days)'
elif days < 300:
return 'Frequent (180-300 days)'
else:
return 'Full-Time Rental (300+ days)'
# Use Q4 data only (most recent)
q4_data = df_analysis[df_analysis['quarter'] == 'Q4'].copy()
q4_data['availability_tier'] = q4_data['availability_365'].apply(availability_tier)
# Count by tier
tier_counts = q4_data['availability_tier'].value_counts()
# Define colors
tier_colors = {
'Rarely Available (<30 days)': '#95a5a6',
'Occasional (30-180 days)': '#2ecc71',
'Frequent (180-300 days)': '#f39c12',
'Full-Time Rental (300+ days)': '#e74c3c'
}
# Order tiers logically
tier_order = ['Rarely Available (<30 days)', 'Occasional (30-180 days)',
'Frequent (180-300 days)', 'Full-Time Rental (300+ days)']
tier_counts = tier_counts.reindex(tier_order)
colors = [tier_colors[tier] for tier in tier_counts.index]
# Create labels — TWO LINES: Tier name on line 1, count + percentage on line 2
total = tier_counts.sum()
labels = [f'{tier}\n{count:,} ({count/total*100:.1f}%)' for tier, count in tier_counts.items()]
squarify.plot(sizes=tier_counts.values, label=labels, color=colors,
alpha=0.85, edgecolor='white', linewidth=3,
text_kwargs={'fontsize': 11, 'fontweight': 'bold'})
ax.set_title('How Often Are NYC Airbnb Listings Available? — Policy Implications (2025)',
fontsize=14, fontweight='bold')
ax.axis('off')
plt.tight_layout()
plt.show()
# Print insight
fulltime = tier_counts['Full-Time Rental (300+ days)']
occasional = tier_counts['Occasional (30-180 days)']
Interpretation:
- Full-Time Rentals (300+ days): 43% —> housing removed from long-term market
- Frequent (180-300 days): 29% —> semi-professional operations
- Occasional (30-180 days): 26% —> closer to "home-sharing" spirit
- Rarely Available (<30 days): 2%
Nearly half the market operates as full-time rentals. This is the policy concern.
# Graph 16: Time Series - Host Registrations
fig, ax = plt.subplots(figsize=(14, 6))
host_data = df_analysis.drop_duplicates(subset='host_id')[['host_id', 'host_since']].dropna()
host_data['host_month'] = host_data['host_since'].dt.to_period('M')
monthly_hosts = host_data.groupby('host_month').size()
monthly_hosts.index = monthly_hosts.index.to_timestamp()
cumulative_hosts = monthly_hosts.cumsum()
ax.fill_between(monthly_hosts.index, monthly_hosts.values, alpha=0.3, color='steelblue')
ax.plot(monthly_hosts.index, monthly_hosts.values, color='steelblue', linewidth=1, label='New Hosts per Month')
ax2 = ax.twinx()
ax2.plot(cumulative_hosts.index, cumulative_hosts.values, color='darkred', linewidth=2.5, label='Cumulative Hosts')
# FIX: Remove empty space - set x-axis limits to data range
ax.set_xlim(monthly_hosts.index.min(), monthly_hosts.index.max())
# Add regulation annotations
# 2016 State Law (advertising ban)
ax.axvline(pd.Timestamp('2016-10-21'), color='#e74c3c', linestyle='--', linewidth=2, alpha=0.8)
ax.text(pd.Timestamp('2016-06-01'), monthly_hosts.max() * 0.95,
'NY State\nAdvertising Ban\n(Oct 2016)', fontsize=9, color='#e74c3c', ha='right', fontweight='bold')
# COVID annotation
ax.axvline(pd.Timestamp('2020-03-01'), color='#7f8c8d', linestyle='--', linewidth=2, alpha=0.8)
ax.text(pd.Timestamp('2020-08-01'), monthly_hosts.max() * 0.75,
'COVID-19\n(Mar 2020)', fontsize=9, color='#7f8c8d', ha='left', fontweight='bold')
# Local Law 18 enforcement (2023)
ax.axvline(pd.Timestamp('2023-09-05'), color='#8e44ad', linestyle='--', linewidth=2, alpha=0.8)
ax.text(pd.Timestamp('2023-09-05'), monthly_hosts.max() * 0.55,
'Local Law 18\nEnforced\n(Sep 2023)', fontsize=9, color='#8e44ad', ha='left', fontweight='bold')
ax.set_xlabel('Date', fontsize=12)
ax.set_ylabel('New Hosts per Month', fontsize=12, color='steelblue')
ax2.set_ylabel('Cumulative Hosts', fontsize=12, color='darkred')
ax.set_title('NYC Airbnb Market Growth: Host Registrations Over Time', fontsize=14)
lines1, labels1 = ax.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax.legend(lines1 + lines2, labels1 + labels2, loc='upper left')
# Light dotted gridlines
ax.grid(True, linestyle=':', linewidth=0.5, alpha=0.4, color='gray', which='major')
ax2.grid(False)
# Clean spines
ax.spines['top'].set_visible(False)
ax2.spines['top'].set_visible(False)
plt.tight_layout()
plt.show()
print(f"\nTotal unique hosts: {len(host_data):,}")
print(f"Earliest host registration: {host_data['host_since'].min().strftime('%Y-%m-%d')}")
Total unique hosts: 13,677 Earliest host registration: 2008-08-11
Interpretation:
The time series reveals the lifecycle of NYC's Airbnb market with four inflection points:
1. Growth Phase (2008-2016): Exponential growth from Airbnb's founding, peaking at ~140 new hosts/month.
2. NY State Advertising Ban (Oct 2016): New York banned advertising short-term rentals that violate the Multiple Dwelling Law. New host registrations began declining immediately.
3. COVID-19 Shock (Mar 2020): Travel restrictions caused a sharp but temporary decline. Recovery began mid-2021.
4. Local Law 18 Enforcement (Sep 2023): NYC began requiring registration for all short-term rental hosts. Platforms like Airbnb were prohibited from listing unregistered rentals.
Key Finding: NYC's layered regulations successfully constrained market growth. New host registrations never recovered to pre-2016 levels.
Sources:
# Graph 17: Folium Interactive Map
oct_data = df_analysis[df_analysis['quarter'] == 'Q4']
map_sample = oct_data.sample(n=min(1500, len(oct_data)), random_state=42)
nyc_center = [40.7128, -74.0060]
m = folium.Map(location=nyc_center, zoom_start=11, tiles='CartoDB positron')
borough_map_colors = {
'Manhattan': 'red',
'Brooklyn': 'blue',
'Queens': 'green',
'Bronx': 'purple',
'Staten Island': 'orange'
}
marker_cluster = MarkerCluster().add_to(m)
for idx, row in map_sample.iterrows():
color = borough_map_colors.get(row['neighbourhood_group_cleansed'], 'gray')
if pd.notna(row.get('review_scores_rating')):
rating_text = f"{row['review_scores_rating']:.2f}"
else:
rating_text = "N/A"
popup_text = f"""<b>{str(row['name'])[:40]}...</b><br>
Price: ${row['price_clean']:.0f}/night<br>
Borough: {row['neighbourhood_group_cleansed']}<br>
Room Type: {row['room_type']}<br>
Rating: {rating_text}"""
folium.CircleMarker(
location=[row['latitude'], row['longitude']],
radius=4,
color=color,
fill=True,
fillColor=color,
fillOpacity=0.6,
popup=folium.Popup(popup_text, max_width=300)
).add_to(marker_cluster)
print("Interactive Map: NYC Airbnb Listings (October 2025)")
print("(Displaying 1,500 sampled listings with clustering)")
# Save and display inline for VSCode
from IPython.display import display, HTML
# At the end of your folium code, replace m with:
display(HTML(m._repr_html_()))
Interactive Map: NYC Airbnb Listings (October 2025) (Displaying 1,500 sampled listings with clustering)
Interpretation:
Listings cluster in Manhattan below Central Park, along Brooklyn's waterfront (Williamsburg, DUMBO), and near transit hubs in Queens. Staten Island is nearly empty.
Graph 18: 3D Exploration — Price, Reviews, Availability¶
Purpose: Interactive multi-dimensional exploration.
# Graph 18: 3D Scatter Plot with Plotly
scatter3d_data = df_analysis[
(df_analysis['number_of_reviews'] > 0) &
(df_analysis['review_scores_rating'].notna())
].sample(n=min(3000, len(df_analysis)), random_state=42)
fig = px.scatter_3d(
scatter3d_data,
x='availability_365',
y='number_of_reviews',
z='price_clean',
color='neighbourhood_group_cleansed',
symbol='room_type',
size='review_scores_rating',
size_max=10,
opacity=0.6,
hover_name='name',
hover_data={'price_clean': ':$.0f', 'neighbourhood_group_cleansed': True},
color_discrete_map={'Manhattan': '#e74c3c', 'Brooklyn': '#3498db', 'Queens': '#2ecc71',
'Bronx': '#9b59b6', 'Staten Island': '#f39c12'},
title='3D Analysis: Availability × Reviews × Price by Borough'
)
fig.update_layout(
scene=dict(
xaxis_title='Availability (Days/Year)',
yaxis_title='Number of Reviews',
zaxis_title='Price ($/Night)'
),
height=700,
width=950
)
from IPython.display import display, HTML
fig.write_html("graph15_3d.html")
display(HTML(open("graph15_3d.html").read()))
Interpretation:
The 3D scatter allows exploration of three variables simultaneously — availability, reviews, and price.
What stands out:
- Manhattan dominates the upper price range : red/orange dots float higher on the price axis
- No clear pattern between reviews and price : the cloud is dispersed
- Availability varies widely : across all price points
Key Takeaway: The lack of visible structure confirms that price is driven primarily by location (borough), not by review count or availability. This is an exploratory view, rotate the chart to discover patterns.
# Summary Statistics
print("=" * 70)
print("NYC AIRBNB MARKET SUMMARY - QUARTERLY ANALYSIS 2025")
print("=" * 70)
print(f"\n📊 COMBINED DATASET")
print(f" Total Listings Analyzed: {len(df_analysis):,}")
for season in ['Q1 Winter (Jan)', 'Q2 Spring (Apr)', 'Q3 Summer (Jul)', 'Q4 Fall (Oct)']:
count = len(df_analysis[df_analysis['season'] == season])
print(f" {season}: {count:,}")
print(f" Unique Hosts: {df_analysis['host_id'].nunique():,}")
print(f"\n💰 SEASONAL PRICING TRENDS")
for season in ['Q1 Winter (Jan)', 'Q2 Spring (Apr)', 'Q3 Summer (Jul)', 'Q4 Fall (Oct)']:
subset = df_analysis[df_analysis['season'] == season]
print(f" {season}: Mean ${subset['price_clean'].mean():.0f} | Median ${subset['price_clean'].median():.0f}")
q1_med = df_analysis[df_analysis['quarter'] == 'Q1']['price_clean'].median()
q4_med = df_analysis[df_analysis['quarter'] == 'Q4']['price_clean'].median()
pct_change = ((q4_med - q1_med) / q1_med * 100)
print(f"\n 📈 Winter → Fall Price Change: {pct_change:+.1f}%")
print(f"\n🏙️ BOROUGH BREAKDOWN")
for borough in df_analysis['neighbourhood_group_cleansed'].value_counts().index:
count = len(df_analysis[df_analysis['neighbourhood_group_cleansed'] == borough])
median = df_analysis[df_analysis['neighbourhood_group_cleansed'] == borough]['price_clean'].median()
pct = count / len(df_analysis) * 100
print(f" {borough}: {count:,} ({pct:.1f}%) - Median ${median:.0f}")
print(f"\n🏠 ROOM TYPE BREAKDOWN")
for room_type, count in df_analysis['room_type'].value_counts().items():
pct = count / len(df_analysis) * 100
print(f" {room_type}: {count:,} ({pct:.1f}%)")
print(f"\n⭐ QUALITY METRICS")
print(f" Average Rating: {df_analysis['review_scores_rating'].mean():.2f}/5")
print(f" Superhosts: {df_analysis['is_superhost'].sum():,} ({df_analysis['is_superhost'].mean()*100:.1f}%)")
====================================================================== NYC AIRBNB MARKET SUMMARY - QUARTERLY ANALYSIS 2025 ====================================================================== 📊 COMBINED DATASET Total Listings Analyzed: 87,050 Q1 Winter (Jan): 22,899 Q2 Spring (Apr): 22,050 Q3 Summer (Jul): 21,091 Q4 Fall (Oct): 21,010 Unique Hosts: 13,678 💰 SEASONAL PRICING TRENDS Q1 Winter (Jan): Mean $182 | Median $132 Q2 Spring (Apr): Mean $205 | Median $144 Q3 Summer (Jul): Mean $211 | Median $150 Q4 Fall (Oct): Mean $218 | Median $151 📈 Winter → Fall Price Change: +14.4% 🏙️ BOROUGH BREAKDOWN Manhattan: 38,331 (44.0%) - Median $200 Brooklyn: 30,338 (34.9%) - Median $125 Queens: 13,605 (15.6%) - Median $100 Bronx: 3,521 (4.0%) - Median $91 Staten Island: 1,255 (1.4%) - Median $100 🏠 ROOM TYPE BREAKDOWN Entire home/apt: 48,789 (56.0%) Private room: 36,946 (42.4%) Hotel room: 809 (0.9%) Shared room: 506 (0.6%) ⭐ QUALITY METRICS Average Rating: 4.73/5 Superhosts: 24,422 (28.5%)
Hypothesis Evaluation¶
Original: "NYC Airbnb prices increase significantly from winter to summer, with Manhattan showing the strongest seasonal variation."
Verdict: CONFIRMED ✅
- Overall prices increased 14.4% (Q1 $132 → Q4 $151)
- Manhattan drove the increase: +34% ($168 → $225)
- Outer boroughs stayed flat (±4%)
Key Findings¶
| # | Finding |
|---|---|
| 1 | Two-Tier Market: Manhattan = dynamic hotel-style pricing. Outer boroughs = stable residential rents. |
| 2 | Location > Everything: Borough determines price. Reviews and ratings barely matter. |
| 3 | Capacity Drives Price: More beds/bedrooms = higher prices within boroughs. |
| 4 | Not "Home-Sharing": Only 8-14% of hosts have one listing. Most run portfolios. |
| 5 | 43% Full-Time Rentals: Nearly half of listings = housing removed from long-term market. |
| 6 | Regulations Work: 2016 and 2023 laws both reduced market growth. |
| 7 | Neighborhood Variation: Within Manhattan, Midtown surges +58% while Harlem stays flat. |
Recommendations¶
For Travelers:
- Book Manhattan in Q1 (January) — save up to 34%
- Consider Brooklyn/Queens — 40-50% cheaper, no seasonal penalty
For Hosts:
- Manhattan: Implement aggressive seasonal pricing
- Outer boroughs: Focus on capacity and amenities
For Policymakers:
- Neighborhood-specific rules may work better than borough-wide regulations
- The 43% full-time rental rate warrants continued attention
References¶
- Data: Inside Airbnb (insideairbnb.com) — NYC Q1-Q4 2025 snapshots
- Regulatory Sources: NYC OSE - Local Law 18
- Tools: Python (pandas, matplotlib, seaborn, plotly, folium, squarify)
- AI Assistance: Claude (Anthropic) — code development support
End of Analysis